Introduction to Open Data Science 2018
Aim of this course is to demonstrate main principles of data handling, statistical methods, and data visualization. In the end of this course, students will be able to apply the principle of reproducible research. In addition, after this course, you will see the possibilities and opportunities of programming.
Data is everywhere, learn how to use it as your advantage!
Data aims to model the relationship between learning approaches and students achievements in an introductory statistics course in Finland. There are total of 7 variables which explain the students achievements (points). Variables are “gender”, “Age”, “Attitude”,“Deep”,“Stra” and “Surf”.
Data has 166 observations (students) and 7 variables. Hence, data is a table with 166 rows and 7 columns.
Variables:
Age; Age (in years) derived from the date of birth
Attitude; Global attitude toward statistics
Points; Exam points
gender; Gender: M (Male), F (Female)
Deep; Deep approach ~d_sm+d_ri+d_ue where, d_sm = Seeking Meaning d_ri = Relating Ideas d_ue = Use of Evidence
Surf; Surface approach ~su_lp+su_um+su_sb where, su_lp = Lack of Purpose su_um = Unrelated Memorising su_sb = Syllabus-boundness
Stra; Strategic approach ~st_os+st_tm where, st_os = Organized Studying st_tm = Time Management
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
data_stud <- read.delim(url, sep=",")
summary(data_stud)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
data_stud <- read.delim(url, sep=",")
boxplot(split(data_stud$points,data_stud$gender),main='Points by Gender')
boxplot(split(data_stud$attitude,data_stud$gender),main='Attitude by Gender')
boxplot(split(data_stud$deep,data_stud$gender),main='Deep by Gender')
boxplot(split(data_stud$stra,data_stud$gender),main='Stra by Gender')
boxplot(split(data_stud$surf,data_stud$gender),main='Surf by Gender')
pairs(~points+age+attitude+deep+stra+surf,data=data_stud,
main="Scatterplot Matrix")
library(corrplot)
## corrplot 0.84 loaded
data_stud$gender <- factor(data_stud$gender)
num_data <- data_stud[,c("points","age","attitude","deep","stra","surf")]
res <- cor(num_data)
corrplot(as.matrix(res), method="circle")
From correlation plot (with circles from red - blue), we are able to notice that Surf and Deep has strong negative correlation. Students that lack motivation, do not seek new ideas or meaning from statistics.
Strategic approach (Stra) correlates positively between all the variables, except Surf.
Describe the work you have done this week and summarize your learning.
I chose three explanatory variables which are:
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
l_model <- lm(points ~ attitude + stra + age, data=data_stud)
summary(l_model)
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = data_stud)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## age -0.08822 0.05302 -1.664 0.0981 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
Multiple linear regression explains the relationship between descriptive variables (attitude, stra, age) and response variables (points). Model fits linear (linear line, bluntly) equation to observed data.
As expected, attitude shows strong statistical significance in the model output. Attitudes p-value is clearly under 0.001 (4.7*10^-9, to be exact). Strategic approach (Stra) shows clear statistical significance: altough p-value is just a bit over 0.05, it is still rather excellent explanator for students points. Age is the worst explanator: p-value of age is high as 0.09.
Although, some of the explanatory variables show statistical significance in linear model, the R-squared value of the whole model stays rather low: 0.20.
Model: Points_est = 10.89543 + (3.48077attitude) + (1.00371stra) - (0.08822*age)
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
l_model <- lm(points ~ attitude + stra, data=data_stud)
summary(l_model)
##
## Call:
## lm(formula = points ~ attitude + stra, data = data_stud)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Model got worse without age-variable. R-squared in new model 0.19. Conclusion: continue with attitude, stra and age.
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
l_model <- lm(points ~ attitude + stra + age, data=data_stud)
plot(l_model)
Aim of the diagnostics plots is to explain how well our multilinear model is actually fitting the data.
Scatterplot represents the correlation between model residuals and predicted values. If residuals are small, the model predicts well. On the other hand, if the scatterplot seems sparse, it indicates that model is not giving good predictions. Remember, out model R^2 was only 0.2.
Normal Q-Q-plot is a normal probability plot: if errors in model follow normal distribution, line is straight. In our case, there seems to be slight divergence from normal distribution in both ends. Extremely low and high points are harder to predict with the model.
Last plot describes the influence of observations in the model. We can interpret from the plot that obs. 56, 4 and 2 have the most influence to our model. We could go back to the original data and check is there something wrong with it, or make conclusion why these particular students have such a different point - attitude - stra - and age combinations. They are not clearly following the trend.
Data has 382 observations (students) and 35 variables. Hence, data is a table with 382 rows and 35 columns.
Data describes students achievement in two Portuguese schools. Attributes give information from student’s social status and studying habits. Although, there are also several variables which describe students overall living habits and morality. Data has been provided regarding the student performance in subjects: Mathematics (mat) and Portuguese language (por).
Article: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
summary(alc_data)
## school sex age address famsize Pstatus
## GP:342 F:198 Min. :15.00 R: 81 GT3:278 A: 38
## MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104 T:344
## Median :17.00
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## Medu Fedu Mjob Fjob
## Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.442
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.034 Mean :0.2906
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.00 Min. :1.000 Min. :1.000
## yes:364 yes:121 1st Qu.:4.00 1st Qu.:3.000 1st Qu.:2.000
## Median :4.00 Median :3.000 Median :3.000
## Mean :3.94 Mean :3.223 Mean :3.113
## 3rd Qu.:5.00 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.00 Max. :5.000 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.00 Min. :1.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:3.000 1st Qu.: 0.000
## Median :1.000 Median :2.00 Median :4.000 Median : 3.000
## Mean :1.474 Mean :2.28 Mean :3.579 Mean : 5.319
## 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000 3rd Qu.: 8.000
## Max. :5.000 Max. :5.00 Max. :5.000 Max. :75.000
## G1 G2 G3 alc_use
## Min. : 3.00 Min. : 0.00 Min. : 0.00 Min. :1.000
## 1st Qu.: 8.00 1st Qu.: 8.25 1st Qu.: 8.00 1st Qu.:1.000
## Median :10.50 Median :11.00 Median :11.00 Median :1.500
## Mean :10.86 Mean :10.71 Mean :10.39 Mean :1.877
## 3rd Qu.:13.00 3rd Qu.:13.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :19.00 Max. :19.00 Max. :20.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:270
## TRUE :112
##
##
##
1. Sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
It’s well-known, that males tend to drink more alcohol than women (especially in “macho-cultures”). Hypothesis: Alcohol consumption is generally higher in “M”-class, thus being male increases the odds to use high amounts of alcohol.
2. Studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
Students, who are willing to study and work hard, do not generally have time for drinking. Drinking alcohol decreases the students moral to study. Hypothesis:: Alcohol consumption decreases when studytime rises. High studytime decreases the odds to alcohol consumption.
3. Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
I suggest, that father’s job and social status has strong correlation between childrens achievements. Father’s alcohol consumption might have even clearer correlation between students drinking habits: child learns from example! Hypothesis: As the fathers social status gets higher, less the children are drinking and getting other bad habits. Father’s job decreases the students odds to high alcohol consumption.
4. Higher - wants to take higher education (binary: yes or no)
When students are ambitious with their studies, they are less likely spending their time drinking alcohol. Hypothesis: answering “yes” in this variable decreases the odds to drinking alcohol.
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
gather(alc_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")+ geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
library(tidyr); library(dplyr); library(ggplot2)
ggplot() +
geom_density(data=alc_data, aes(x=alc_use, group=sex, fill=sex),alpha=0.5, adjust=2) +
xlab("Alcohol use (1-5)") +
ylab("Density")
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$sex <- factor(alc_data$sex)
library(tidyr); library(dplyr); library(ggplot2); library(descr)
CrossTable(alc_data$high_use, alc_data$sex)
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
## ==========================================
## alc_data$sex
## alc_data$high_use F M Total
## ------------------------------------------
## FALSE 157 113 270
## 2.078 2.236
## 0.581 0.419 0.707
## 0.793 0.614
## 0.411 0.296
## ------------------------------------------
## TRUE 41 71 112
## 5.009 5.390
## 0.366 0.634 0.293
## 0.207 0.386
## 0.107 0.186
## ------------------------------------------
## Total 198 184 382
## 0.518 0.482
## ==========================================
There seems to be somewhat clear division in alcohol consumption between males and females.Females alcohol consumption is focused on classes 1 and 2, when males drinking is more divided through all the consumption classes. It seems to be really rare to find a portuguese females who report themselves into alc use classes 4-5.
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$studytime <- factor(alc_data$studytime)
library(tidyr); library(dplyr); library(ggplot2)
ggplot() +
geom_density(data=alc_data, aes(x=alc_use, group=studytime, color=studytime),alpha=1, adjust=2, size=2) +
xlab("Alcohol use (1-5)") +
ylab("Density")
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$studytime <- factor(alc_data$studytime)
library(tidyr); library(dplyr); library(ggplot2); library(descr)
CrossTable(alc_data$high_use, alc_data$studytime)
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
## ==========================================================
## alc_data$studytime
## alc_data$high_use 1 2 3 4 Total
## ----------------------------------------------------------
## FALSE 60 133 54 23 270
## 2.251 0.012 2.364 0.804
## 0.222 0.493 0.200 0.085 0.707
## 0.583 0.700 0.871 0.852
## 0.157 0.348 0.141 0.060
## ----------------------------------------------------------
## TRUE 43 57 8 4 112
## 5.426 0.030 5.699 1.937
## 0.384 0.509 0.071 0.036 0.293
## 0.417 0.300 0.129 0.148
## 0.113 0.149 0.021 0.010
## ----------------------------------------------------------
## Total 103 190 62 27 382
## 0.270 0.497 0.162 0.071
## ==========================================================
Studytime has correlation to alcohol consumption: as the studytime rises, the probability to use alcohol decreases. Students who use exceptionally great amounts of time studying (class 3 & 4) seem to be using almost no alcohol, mostly classes 1-2. Fancy thing is, that students who are in studytime class 3 are not participating in heavy drinking (class 5). These people could be described as “middle road walkers”.
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$Fjob <- factor(alc_data$Fjob)
library(tidyr); library(dplyr); library(ggplot2)
ggplot() +
geom_density(data=alc_data, aes(x=alc_use, group=Fjob, color=Fjob),alpha=1, adjust=2, size=2) +
xlab("Alcohol use (1-5)") +
ylab("Density")
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$Fjob <- factor(alc_data$Fjob)
library(tidyr); library(dplyr); library(ggplot2); library(descr)
CrossTable(alc_data$high_use, alc_data$Fjob)
## Warning in chisq.test(tab, correct = FALSE, ...): Chi-squared approximation
## may be incorrect
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
## ==========================================================================
## alc_data$Fjob
## alc_data$high_use at_home health other services teacher Total
## --------------------------------------------------------------------------
## FALSE 14 13 149 69 25 270
## 0.640 0.081 0.000 0.581 0.435
## 0.052 0.048 0.552 0.256 0.093 0.707
## 0.875 0.765 0.706 0.645 0.806
## 0.037 0.034 0.390 0.181 0.065
## --------------------------------------------------------------------------
## TRUE 2 4 62 38 6 112
## 1.544 0.194 0.000 1.400 1.050
## 0.018 0.036 0.554 0.339 0.054 0.293
## 0.125 0.235 0.294 0.355 0.194
## 0.005 0.010 0.162 0.099 0.016
## --------------------------------------------------------------------------
## Total 16 17 211 107 31 382
## 0.042 0.045 0.552 0.280 0.081
## ==========================================================================
Contrary to my earlier hypothesis, father’s job does not seem to have major influence on student’s alcohol consumption. All of the job-classes are divided fairly identical. Interesting feature in father’s job attribute is the “health”-class, which has high altitude in alcohol use 1. On the other hand, there is minor peak in classes 3-4. This could indicate that students whose fathers’ work at ‘health’ care related -jobs tend to use much less alcohol in general, but there are few rebellious students, who stand against their fathers’ health doctrines.
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$higher <- factor(alc_data$higher)
library(tidyr); library(dplyr); library(ggplot2)
ggplot() +
geom_density(data=alc_data, aes(x=alc_use, group=higher, color=higher),alpha=1, adjust=2, size=2) +
xlab("Alcohol use (1-5)") +
ylab("Density")
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$higher <- factor(alc_data$higher)
library(tidyr); library(dplyr); library(ggplot2); library(descr)
CrossTable(alc_data$high_use, alc_data$higher)
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
## ==========================================
## alc_data$higher
## alc_data$high_use no yes Total
## ------------------------------------------
## FALSE 9 261 270
## 1.089 0.054
## 0.033 0.967 0.707
## 0.500 0.717
## 0.024 0.683
## ------------------------------------------
## TRUE 9 103 112
## 2.626 0.130
## 0.080 0.920 0.293
## 0.500 0.283
## 0.024 0.270
## ------------------------------------------
## Total 18 364 382
## 0.047 0.953
## ==========================================
Students ambitions for higher education, does indeed seem to have slight influence on students drinking habits: Higher education seem to attract students who are not using alcohol as much as students who do not have motivation for higher education. Alc. use distribution inside group: “no”, is quite incoherent, and divided more evenly than group: “yes”. It seems that there are students who do not use alcohol but do not either have plans for higher education.
Logistic regression is traditionally used for modelling regression between predictors (continuous, categorical or a mix of both) and categorical variable, y, which is binary (0/1). Model describes the probability (or odds) for variable “y”, to get value = “1”, with set of predictors. In our case, we are modelling the probability for student to have high alcohol consumption (1), with predictors: sex, studytime, father’s job and ambition for higher education.
Before applying logistic regression model, we need to split the data into training and testing data. I decided to choose first 300 students as training set for the model. Thus, 82 reimaining students are left for testing the model.
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
train <- alc_data[1:300,]
test <- alc_data[301:382,]
trainset = [1] 300 35
testset = [1] 82 35
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
train <- alc_data[1:300,]
train$sex <- factor(train$sex)
train$studytime <- factor(train$studytime)
train$Fjob <- factor(train$Fjob)
train$higher <- factor(train$higher)
train$high_use <- factor(train$high_use)
model <- glm(high_use ~ sex + studytime + Fjob + higher,family=binomial(link='logit'),data=train)
summary(model)
##
## Call:
## glm(formula = high_use ~ sex + studytime + Fjob + higher, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2661 -0.9208 -0.6262 1.0912 2.2832
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8926 1.2056 -1.570 0.11645
## sexM 0.5782 0.2841 2.035 0.04183 *
## studytime2 -0.2667 0.3023 -0.882 0.37766
## studytime3 -1.7878 0.6629 -2.697 0.00699 **
## studytime4 -1.4421 0.6781 -2.127 0.03344 *
## Fjobhealth 1.4410 1.2415 1.161 0.24576
## Fjobother 1.2687 1.0925 1.161 0.24551
## Fjobservices 1.8110 1.1074 1.635 0.10199
## Fjobteacher 0.3926 1.2103 0.324 0.74566
## higheryes -0.2904 0.5513 -0.527 0.59834
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 361.29 on 299 degrees of freedom
## Residual deviance: 329.60 on 290 degrees of freedom
## AIC: 349.6
##
## Number of Fisher Scoring iterations: 5
From logistic model summary-table we are able to recognize that predictors: sex and studytime have statistical significance in the model. Although, studytime has significance only inside two of it’s terms (3 and 4).
Sex: According to our model, being male as student increases the odds for high alcohol use by 0.57, when compared to women. P-value is slightly under the 0.05 treshold-value, 0.04183.
Studytime: Increasing studytime does indeed lower your odds to have bad drinking habits: however, studytime truly begins to decrease odds when student moves from term 2 to term 3. To put it simple, students who study 5 to 10 hours or more on weekly basis, have considerably smaller odds to have high alcohol use. Studytime 3 p-value is 0.00699 (lowest value in the model) and correspondingly studytime 4, 0.03344.
Father’s job: No statistical significance in any of the predictor terms. However, it seems clear that, students whose fathers are working as teachers are less likely to become alcoholists.
Pursue for higher education: No statistical significance, but having ambition for higher education lowers the odds for high alcohol use by -0.29.
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
train <- alc_data[1:300,]
train$sex <- factor(train$sex)
train$studytime <- factor(train$studytime)
train$Fjob <- factor(train$Fjob)
train$higher <- factor(train$higher)
train$high_use <- factor(train$high_use)
model <- glm(high_use ~ sex + studytime + Fjob + higher,family=binomial(link='logit'),data=train)
anova(model)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: high_use
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 299 361.29
## sex 1 8.4311 298 352.86
## studytime 3 13.9372 295 338.92
## Fjob 4 9.0447 291 329.88
## higher 1 0.2747 290 329.60
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(model)
## llh llhNull G2 McFadden r2ML
## -164.80162560 -180.64550478 31.68775835 0.08770702 0.10023878
## r2CU
## 0.14317798
Anova-table describes, how well our logistic model predicts high use of alcohol when compared to null-model. Null-model uses only intercept value.
Table shows that, largest difference to deviance comes from studytime. On contrast, students ambitions to higher education, does make small difference to model (0.27).
Furthermore, model fit can be assessed with McFadden’s pseudo R^2-value, which gives only 0.08 for our model.
The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept).
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
train <- alc_data[1:300,]
train$sex <- factor(train$sex)
train$studytime <- factor(train$studytime)
train$Fjob <- factor(train$Fjob)
train$higher <- factor(train$higher)
train$high_use <- factor(train$high_use)
model <- glm(high_use ~ sex + studytime + Fjob + higher,family=binomial(link='logit'),data=train)
exp(cbind("Odds ratio" = coef(model), confint.default(model, level = 0.95)))
## Odds ratio 2.5 % 97.5 %
## (Intercept) 0.1506773 0.01418480 1.6005605
## sexM 1.7829081 1.02160476 3.1115372
## studytime2 0.7658945 0.42348388 1.3851635
## studytime3 0.1673226 0.04563802 0.6134542
## studytime4 0.2364228 0.06258757 0.8930806
## Fjobhealth 4.2250405 0.37072282 48.1517881
## Fjobother 3.5563949 0.41790531 30.2650962
## Fjobservices 6.1164281 0.69798147 53.5984042
## Fjobteacher 1.4807783 0.13814056 15.8729946
## higheryes 0.7479459 0.25386006 2.2036669
Table above shows odds ratio of each model predictor. Being male increases the probability to have high alcohol consumption by 1.78. All of the studytime terms (2-4) decrease the odds. Odds ratios are based on asymptonic normality.
This exercise focuses on how to classify and cluster data into categories according to characteristics of the data. Clustering methods try to find the similarities from the data and on the other what are the attributes which divide data?
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data(Boston)
dim(Boston)
## [1] 506 14
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
“Boston” -dataset describes social and environmental characteristics of Boston city.
Boston data has 506 rows (observations) and 14 columns (attributes).
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000’s
***Explanations https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html.***
library(MASS); library("GGally")
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
data(Boston)
Boston$chas <- factor(Boston$chas)
lowerFn <- function(data, mapping, ...) {
p <- ggplot(data = data, mapping = mapping) +
geom_point(color = 'blue', alpha=0.3, size=1) +
geom_smooth(color = 'black', method='lm', size=1,...)
p
}
g <- ggpairs(
data = Boston,
lower = list(
continuous = wrap(lowerFn) #wrap("smooth", alpha = 0.3, color = "blue", lwd=1)
),
upper = list(continuous = wrap("cor", size = 4))
)
g <- g + theme(
axis.text = element_text(size = 4),
axis.title = element_text(size = 4),
legend.background = element_rect(fill = "white"),
panel.grid.major = element_line(colour = NA),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey95")
)
print(g, bottomHeightProportion = 0.5, leftWidthProportion = .5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library(MASS);
data(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The data consists mainly of continuous variables; only categorical variable is “CHAS” -attribute, which gets binary values (0/1). CHAS equals to 1 if tract bounds Charles River, 0 otherwise. Other variables are ratios and or proportions which describe the city districts: social, economical and environmental factors affecting the price of houses and apartments in Boston. From graphical overview, we are able to recognize several strong correlations between the variables: in this interpretation, I will be discussing only the major trends in the data.
The strongest correlation in the data can be found between MEDV (house price) and LSTAT (% lower status of pop.) - correlation of -0.74. As the lower status pop. increases the median housing price decreases rapidly.
RM (average number of rooms per dwelling) has strong positive correlation (0.70) with MEDV. This is also pretty obvious correlation - more rooms (and space), more costly housing. PRATIO has strong negative correlation (-0.51) with MEDV. PRATIO describes the pupil -teacher ratio by town. Attribute reflects the overall social status in the town: there is often less funds to offer for schools at less popular and poorer parts of city - class-sizes tend to grow. At socially “better” areas there are private schools, which have lower PRATIO.
Other correlations between the study variables worth to mention: AGE (proportion of owner-occupied units built prior to 1940) of the houses seems to correlate between LSTAT and DIS (weighted distances to five Boston employment centres). Lower status population is living in older houses (prior -40’s), because they can’t simply afford new one and houses at that age are located far from Boston employment centres.
Scale-function standardizes data’s each observation by column average and standard deviatian.
library(MASS);
data(Boston)
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
From summary we are able to recognize the result of standardization: Min. values in in each variable are negative, since all the values below mean of the column are now below zero. Also, higher the deviation in col. - smaller the min. value is. On the other hand, the max. is higher if there is small deviation in data. Mean value is always 0 in each variable.
library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
labels <- c("low","med_low","med_high","high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = labels)
levels(crime)
## [1] "low" "med_low" "med_high" "high"
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
In the code above, CRIM -varible has been transformed into categorical (4-levels) using the quantiles of standardized CRIM. We basically divided the data into 4 groups, where first group holds values under the lower quantile (0.25), second group values between lower and mean (0.25-0.5) and so on. I renamed the levels into more appropriate. Since we standardized the data earlier, all the classes have exactly same amount of observations (low and high has 127, because data cant be divided with 4 equally)
New
In order to create proper predictative models, we need to split our data into training and testing datasets. In our case, we split our data into training (80 %) and test-sets (20 %) with random sample and remove crime-variable from test-set. Next we fit the data with discriminant analysis where crime is target variable and all the other variables in the dataset as predictor variables. LDA is a data classification method, where we aim to find linear combination of variables which separate target variable classes (crime-levels).
# boston_scaled is available
library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
labels <- c("low","med_low","med_high","high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = labels)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- dim(boston_scaled)[1]
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.269802 0.250000 0.230198 0.250000
##
## Group means:
## zn indus chas nox rm
## low 1.01149275 -0.9041165 -0.1278483251 -0.8837112 0.4624630
## med_low -0.05019056 -0.3381096 0.0395204559 -0.5716425 -0.0787730
## med_high -0.36644653 0.2175185 0.2356838659 0.4951792 0.1070958
## high -0.48724019 1.0149946 0.0005392655 1.0391462 -0.3805584
## age dis rad tax ptratio
## low -0.8771076 0.8495517 -0.6857985 -0.7646398 -0.38585272
## med_low -0.3192240 0.3797467 -0.5384037 -0.4578098 -0.03799923
## med_high 0.5094399 -0.4456141 -0.4557992 -0.3414618 -0.50047014
## high 0.8035902 -0.8567003 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.38110814 -0.78611638 0.56442189
## med_low 0.31206388 -0.14857511 0.02280804
## med_high 0.06915263 0.05167016 0.21541261
## high -0.67010018 0.81546327 -0.68501269
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12744564 0.57789314 -1.07179809
## indus 0.07777846 -0.22933493 0.05521080
## chas -0.07521587 -0.02866796 0.15741680
## nox 0.17602658 -0.80450764 -1.14805110
## rm -0.16487470 -0.09883885 -0.08391882
## age 0.17504622 -0.40745764 -0.05776690
## dis -0.15787695 -0.30063291 0.39539937
## rad 4.09952460 0.92556772 -0.55152113
## tax 0.05204753 -0.06035758 1.08421286
## ptratio 0.15970219 0.18467971 -0.21682399
## black -0.15259442 0.07014070 0.10758938
## lstat 0.19126940 -0.24350027 0.38815400
## medv 0.23136379 -0.35076957 -0.10166842
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9598 0.0309 0.0093
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 2)
# boston_scaled is available
library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
labels <- c("low","med_low","med_high","high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = labels)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- dim(boston_scaled)[1]
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.fit <- lda(crime ~ ., data = train)
classes <- as.numeric(train$crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 20 5 2 0
## med_low 5 17 5 0
## med_high 0 7 16 1
## high 0 0 0 24
Table above shows that our model is predicting crime-rates fairly well: it seems to predict the lowest and highest proportions really well. On contrary, model struggles with medium classes. Earlier LDA-plot shows that there exists med-low and med-high points near the low- and high -class concentrations. It seems data should be categorized into low/high -classes instead on 4.
Calculating imaginary distance between the observations is one method to sort data: in the code I standardize the original data and calculate euclidean distance -matrix and later the manhattan distance -matrix. Matrices has pairwise distances between observations (city districts) and it would be too heavy to summarize properly. I have printed the short, summary-tables with descriptive statistical variables. It seems that manhattan distance (below) has generally longer distances: it’s pretty obvious, since in manhattan distance, we are not choosing the “straight path”. (More from https://www.quora.com/What-is-Manhattan-Distance)
library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method ="manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Kmeans -classification is one of the most common function to categorize data. It’s been used in many fields, such as image analysis: pixel values in image can be clustered fairly easily with Kmeans. Success of the method relies on many aspects such as the homogenity of the the groups. Below I calculate Kmeans-clustering for standardized Boston -data, with 4 clusters.
library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# k-means clustering
km <-kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster,main="Clusters = 4")
pairs(Boston[6:10], col = km$cluster,main="Clusters = 4")
While observing the pairs-plot, it seems rather clear, that clustering data with four groups is not providing very good results: it looks like in most variables 2 groups would do the trick. Lets try with 3 clusters and then by 2.
library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# k-means clustering
km <-kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster, main="Clusters = 3")
pairs(Boston[6:10], col = km$cluster, main="Clusters = 3")
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster, main="Clusters = 2")
pairs(Boston[6:10], col = km$cluster, main="Clusters = 2")
For the last part, we will use the within cluster sum of squares (WCSS) to optimize the number of clusters in data. Optimal number is found when the WCSS-value dives in cluster - WCSS -plot. Method produces different solution every time, since function sets it’s “seed-points” randomly.
library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
Cluster-WCSS -plot shows that the optimal number of clusters seems to be at 2. Line takes radical drop at that point, adding clusters does not seem to add value to clustering.
Statistical problems are often multidimensional: in order to understand different phenomena, it is necessary to “chop” variables into smaller pieces. This is what dimensionality reduction techniques aims for - we take original data variables and transform them into few components that collect together as much variance as possible from the original data. Thus, components make data interpretation, plotting etc. easier and more informative.
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt",sep =",", header = T)
dim(human)
## [1] 155 8
head(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth
## Norway 1.0072389 0.8908297 17.5 81.6 64992 4 7.8
## Australia 0.9968288 0.8189415 20.2 82.4 42261 6 12.1
## Switzerland 0.9834369 0.8251001 15.8 83.0 56431 6 1.9
## Denmark 0.9886128 0.8840361 18.7 80.2 44025 5 5.1
## Netherlands 0.9690608 0.8286119 17.9 81.6 45435 6 6.2
## Germany 0.9927835 0.8072289 16.5 80.9 43919 7 3.8
## Parli.F
## Norway 39.6
## Australia 30.5
## Switzerland 28.5
## Denmark 38.0
## Netherlands 36.9
## Germany 36.9
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
Human -data consists of 9 attributes and 195 observations (countries). Varibales describe countries “Human development index”: The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living. The HDI is the geometric mean of normalized indices for each of the three dimensions. (http://hdr.undp.org)
Edu2.FM (Education index (females))
Labo.FM (Labour force participation rate (females))
Edu.Exp (Expected years at schooling)
Life.Exp (Life expectancy index)
GNI (Gross national income ($))
Mat.Mor (Material morality)
Ado.Birth (Adolescent birth rate)
Parli.F (Parliamentary representation (females))
library("GGally")
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt",sep =",", header = T)
human = transform(human, GNI = as.numeric(GNI))
human = transform(human, Mat.Mor = as.numeric(Mat.Mor))
lowerFn <- function(data, mapping, ...) {
p <- ggplot(data = data, mapping = mapping) +
geom_point(color = 'blue', alpha=0.3, size=1) +
geom_smooth(color = 'black', method='lm', size=1,...)
p
}
g <- ggpairs(
data = human,
lower = list(
continuous = wrap(lowerFn) #wrap("smooth", alpha = 0.3, color = "blue", lwd=1)
),
upper = list(continuous = wrap("cor", size = 4))
)
g <- g + theme(
axis.text = element_text(size = 4),
axis.title = element_text(size = 4),
legend.background = element_rect(fill = "white"),
panel.grid.major = element_line(colour = NA),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey95")
)
print(g, bottomHeightProportion = 0.5, leftWidthProportion = .5)
Correlations:
For the sake of clarity I chose one attribute which clearly correlates with other data variables and therefore explains the data well: Ado.Birth
The adolescent birth rate is the annual number of live births to adolescent women per 1,000 adolescent women. The adolescent birth rate is also referred to as the age-specific fertility rate for women aged 15-19
Edu2.FM: Strong negative correlation (-0.53) - countries, where women have children at very young age do not participate in education process (at all).
Edu.exp: Strong negative correlation (-0.70) - the expected years of schooling is very low if there is high adolescent birth rate. Children (males, females) are not participating in education process, because they reproduce quickly after puberty.
Life.exp: Strong negative correlation (-0.73) - expected individual lifespan decreases as adolescent birth rate increases. There are many issues related - having children without stable incomes and education does not predict good starting point for offspring.
GNI: trong negative correlation (-0.56) - GNI (country incomes) decrease while Ado.Birth increases: no education - no incomes.
Mat.Mor (def. The people of a society share a standard dignity through high morals making the moral fabric a keystone; keeping the arch of society and the morals it holds high-together. https://www.collinsdictionary.com/submission/8889/Moral+fabric) - positive correlation, adolescent birth rate increases Mat.Mor (does not make sense, I guess we should think that as Mat.Mor increases the moral of the country decreases)
Next we will perform PCA-analysis for the HDI-dataset. We will be solving the components that capture most of the variability in the data.
First we perform PCA with unstandardized data.
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt",sep =",", header = T)
human = transform(human, GNI = as.numeric(GNI))
human = transform(human, Mat.Mor = as.numeric(Mat.Mor))
pca_human <- prcomp(human, scale=FALSE)
print("The variability captured by the principal components:")
## [1] "The variability captured by the principal components:"
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## Edu2.FM -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04
## Labo.FM 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03
## Edu.Exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02
## Life.Exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05
## Mat.Mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03
## Ado.Birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03
## Parli.F -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01
## PC5 PC6 PC7 PC8
## Edu2.FM -0.0022935252 2.180183e-02 6.998623e-01 7.139410e-01
## Labo.FM 0.0022190154 3.264423e-02 7.132267e-01 -7.001533e-01
## Edu.Exp 0.1431180282 9.882477e-01 -3.826887e-02 7.776451e-03
## Life.Exp 0.9865644425 -1.453515e-01 5.380452e-03 2.281723e-03
## GNI -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor 0.0266373214 1.695203e-03 1.355518e-04 8.371934e-04
## Ado.Birth 0.0188618600 1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F -0.0716401914 -2.309896e-02 -2.642548e-03 2.680113e-03
biplot(pca_human)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Next we will run PCA and plot results with standardized data.
Definition of scale: “log simply takes the logarithm (base e, by default) of each element of the vector. scale, with default settings, will calculate the mean and standard deviation of the entire vector, then”scale" each element by those values by subtracting the mean and dividing by the sd." (https://stackoverflow.com/questions/20256028/understanding-scale-in-r)
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt",sep =",", header = T)
human = transform(human, GNI = as.numeric(GNI))
human = transform(human, Mat.Mor = as.numeric(Mat.Mor))
human_std <- scale(human)
pca_human <- prcomp(human_std, scale=T)
print("The variability captured by the principal components:")
## [1] "The variability captured by the principal components:"
pca_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
biplot(pca_human)
As we can recognize from figures above, PCA-analysis on unstandardized data can produce unwanted results - mainly because of variables with the highest sample variances tend to be emphasized in the first few principal components. Thus, principal component analysis using the covariance function should only be considered if all of the variables have the same units of measurement.
library(dplyr); library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt",sep =",", header = T)
human = transform(human, GNI = as.numeric(GNI))
human = transform(human, Mat.Mor = as.numeric(Mat.Mor))
human_std <- scale(human)
pca_human <- prcomp(human_std, scale=T)
fviz_eig(pca_human)
s <- summary(pca_human)
print("Summary of PCA-analysis")
## [1] "Summary of PCA-analysis"
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
pca_pr <- round(1*s$importance[2, ], digits = 2)
print("The variability captured by the principal components:")
## [1] "The variability captured by the principal components:"
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 0.54 0.16 0.10 0.08 0.05 0.04 0.03 0.01
#str(as.data.frame(as.table(human_std)))
fviz_pca_var(pca_human, col.var="contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
# Contributions of variables to PC1
fviz_contrib(pca_human, choice = "var", axes = 1, top = 10)
# Contributions of variables to PC2
fviz_contrib(pca_human, choice = "var", axes = 2, top = 10)
pca_pr = pca_pr*100
pc_lab = paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "red"), xlab = pc_lab[1], ylab = pc_lab[2])
From figure above, first principal component (PC1) explains 54 % of the variance in data. Second component (PC2) explains much less, total 16 %.
As we can see from the biplot above, it seems that data can be divided into three variable groups, which are:
Edu.Exp, GNI, Edu2.FM, LIfe.exp
Parli.F, Labo.FM
Mat.Mor, Abo.Birth
If we have to choose two data dimensions (variables) to explain the data, I would choose Life.Exp and Mat.Mor, since they have the largest contribution in 1st. principal component (bar-chart above). Although, I believe second dimension could be taken from second component, such as Labo.FM, which contributes over 50 %.
library(FactoMineR);library(dplyr);library(MASS)
data(tea)
keep <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, keep)
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
Tea-data has 300 observations and originally 36 variables; we chose 6 variables for further examination. Every attribute is categorical: 2 are binary others have 3-4 classes.
library(FactoMineR);library(MASS);library(dplyr);library(tidyr);library(ggplot2)
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, keep_columns)
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
library(FactoMineR);library(MASS);library(dplyr);library(tidyr);library(ggplot2);library("factoextra")
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, keep_columns)
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
#
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 45))
fviz_contrib(mca, choice = "var", axes = 1, top = 10)
fviz_contrib(mca, choice = "var", axes = 2, top = 10)
plot(mca, invisible=c("ind"))
plot(mca, invisible=c("ind"),habillage = "quali")
fviz_mca_var(mca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, # Avoid text overlapping
ggtheme = theme_minimal())
fviz_mca_ind(mca,
label = "none", # hide individual labels
habillage = "how", # color by groups
palette = c("#00AFBB", "#E7B800","#d9b3ff"),
addEllipses = TRUE, ellipse.type = "confidence",
ggtheme = theme_minimal())
Conclusion of the MCA-analysis on tea-data is that data can be described very poorly by any individual dimension (categorical class), as we can see from the first bar-plot: all the variables are somewhat equal - from 15 % to 5 %. In the first dimensions, most of the data variance is caught by attributes “where” and “how”. Both explain variance over 20 %.
In Variable-categories figure, we are able to recognize that how- and where -related var. are well represented by our 2 dimensions (MCA). Cos2-value defines how well category can be explained by components produced by MCA.
In the last figure, I plotted the observations in MCA-dimensions and categorized them with “how”- variable: it seems the data can be described fairly well with those 3 categories. Same phenomenom can be seen with “where”-variable.
RATS data consists of rat weight measurements over 9 weeks period. Rats were divided into three groups, where each group had individual nutrient diet. Aim of the study was to solve the differences between diets through weight development of study rats.
Data has total of 16 rats, divided somewhat equally to the diet groups. Group 1 had 8 obs. and other 2 groups had only 4.
Weight was measured in 9 phases (longitudinal). More detailed information is shown below.
library(dplyr)
library(tidyr)
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
names(RATS)
## [1] "ID" "Group" "WD1" "WD8" "WD15" "WD22" "WD29" "WD36"
## [9] "WD43" "WD44" "WD50" "WD57" "WD64"
summary(RATS)
## ID Group WD1 WD8 WD15
## 1 : 1 1:8 Min. :225.0 Min. :230.0 Min. :230.0
## 2 : 1 2:4 1st Qu.:252.5 1st Qu.:255.0 1st Qu.:255.0
## 3 : 1 3:4 Median :340.0 Median :345.0 Median :347.5
## 4 : 1 Mean :365.9 Mean :369.1 Mean :372.5
## 5 : 1 3rd Qu.:480.0 3rd Qu.:476.2 3rd Qu.:486.2
## 6 : 1 Max. :555.0 Max. :560.0 Max. :565.0
## (Other):10
## WD22 WD29 WD36 WD43
## Min. :232.0 Min. :240.0 Min. :240.0 Min. :243.0
## 1st Qu.:267.2 1st Qu.:268.8 1st Qu.:267.2 1st Qu.:269.2
## Median :351.5 Median :356.5 Median :360.0 Median :360.0
## Mean :379.2 Mean :383.9 Mean :387.0 Mean :386.0
## 3rd Qu.:492.5 3rd Qu.:497.8 3rd Qu.:504.2 3rd Qu.:501.0
## Max. :580.0 Max. :590.0 Max. :597.0 Max. :595.0
##
## WD44 WD50 WD57 WD64
## Min. :244.0 Min. :238.0 Min. :247.0 Min. :245.0
## 1st Qu.:270.0 1st Qu.:273.8 1st Qu.:273.8 1st Qu.:278.0
## Median :362.0 Median :370.0 Median :373.5 Median :378.0
## Mean :388.3 Mean :394.6 Mean :398.6 Mean :404.1
## 3rd Qu.:510.5 3rd Qu.:516.0 3rd Qu.:524.5 3rd Qu.:530.8
## Max. :595.0 Max. :612.0 Max. :618.0 Max. :628.0
##
glimpse(RATS)
## Observations: 16
## Variables: 13
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1 <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 5...
## $ WD8 <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 5...
## $ WD15 <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 5...
## $ WD22 <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 5...
## $ WD29 <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 5...
## $ WD36 <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 5...
## $ WD43 <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 5...
## $ WD44 <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 5...
## $ WD50 <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 6...
## $ WD57 <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 6...
## $ WD64 <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 6...
library(dplyr)
library(tidyr)
library(ggplot2)
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL <- RATS %>%
gather(key = WD, value = Weight, -ID, -Group) %>%
mutate(Time = as.integer(substr(WD,3,4)))
ggplot(RATSL, aes(x = Time, y = Weight, group = ID,color=Group)) +
geom_line() +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)") +
theme(legend.position = "top")
Note From the individual rat growth profile, we are able to recognize that the diet group 1 are generally lighter, when compared to other two groups. Also, its easy to see that the rats in group 1 get weight much less than the rats in other groups.
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL <- RATS %>%
gather(key = WD, value = Weight, -ID, -Group) %>%
mutate(Time = as.integer(substr(WD,3,4)))
pairs(RATS[,3:11],
main="Scatterplot matrix of repeated measures in rat growth data.")
Note From scatterplot matrix we are able to recognize that there are dependencies between the measurements made at different points of time: this is of course obvious since the rats who are getting weight - will most likely get weight on upcoming measurements aswell
Longitudinal data, where a response variable is measured on each subject on several different occasions poses problems for their analysis because the repeated measurements on each subject are very likely to be correlated rather than independent. (MABS4IODS, part VI)
Next we will create a random intercept and random slope model with the interaction (Group x Time Interaction to Rat Growth Data) and compare it to model without interaction (anova table).
library(dplyr)
library(tidyr)
library(ggplot2)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
library(cowplot)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL <- RATS %>%
gather(key = WD, value = Weight, -ID, -Group) %>%
mutate(Time = as.integer(substr(WD,3,4)))
g1 <- ggplot(RATSL, aes(x = Time, y = Weight, group = ID, color=Group)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "Observed weight (grams)") +
theme(legend.position = "top")
RATS_ref1 <- lmer(Weight ~ Time + Group + (Time | ID), data = RATSL, REML = FALSE)
RATS_ref2 <- lmer(Weight ~ Time * Group + (Time | ID), data = RATSL, REML = FALSE)
anova(RATS_ref2, RATS_ref1)
## Data: RATSL
## Models:
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
## RATS_ref2: Weight ~ Time * Group + (Time | ID)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## RATS_ref1 8 1194.2 1219.6 -589.11 1178.2
## RATS_ref2 10 1185.9 1217.6 -582.93 1165.9 12.361 2 0.00207 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fitted <- fitted(RATS_ref2)
RATSL <- RATSL %>%
mutate(fitted = as.numeric(Fitted))
g2 <- ggplot(RATSL, aes(x = Time, y = fitted, group = ID, color=Group)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "Fitted weight (grams)") +
theme(legend.position = "top")
plot_grid(g1, g2, labels = "AUTO")
Note From the Anova, we are able to conclude that interaction is 12.36 (Chisq Chi) with 2 DF; the associated p-value is very small, and we can conclude that the interaction model provides a better fit for the rat growth data (MABS4IODS)
Stat. signif. is 0.00207 **.
Likewise the earlier RATS data, BPRS (Brief Psychiatric Rating Scale?) data has measurements made over time. The data consists of 40 subjects who have been interviewed by doctors for 9 weeks (one measurement each week). Measurements are presented by bprs-value.
Below is the summary of the dataframe.
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
names(BPRS)
## [1] "treatment" "subject" "week0" "week1" "week2"
## [6] "week3" "week4" "week5" "week6" "week7"
## [11] "week8"
dim(BPRS)
## [1] 40 11
summary(BPRS)
## treatment subject week0 week1
## Min. :1.0 Min. : 1.00 Min. :24.00 Min. :23.00
## 1st Qu.:1.0 1st Qu.: 5.75 1st Qu.:38.00 1st Qu.:35.00
## Median :1.5 Median :10.50 Median :46.00 Median :41.00
## Mean :1.5 Mean :10.50 Mean :48.00 Mean :46.33
## 3rd Qu.:2.0 3rd Qu.:15.25 3rd Qu.:58.25 3rd Qu.:54.25
## Max. :2.0 Max. :20.00 Max. :78.00 Max. :95.00
## week2 week3 week4 week5
## Min. :26.0 Min. :24.00 Min. :20.00 Min. :20.00
## 1st Qu.:32.0 1st Qu.:29.75 1st Qu.:28.00 1st Qu.:26.00
## Median :38.0 Median :36.50 Median :34.50 Median :30.50
## Mean :41.7 Mean :39.15 Mean :36.35 Mean :32.55
## 3rd Qu.:49.0 3rd Qu.:44.50 3rd Qu.:43.00 3rd Qu.:38.00
## Max. :75.0 Max. :76.00 Max. :66.00 Max. :64.00
## week6 week7 week8
## Min. :19.00 Min. :18.0 Min. :20.00
## 1st Qu.:22.75 1st Qu.:23.0 1st Qu.:22.75
## Median :28.50 Median :30.0 Median :28.00
## Mean :31.23 Mean :32.2 Mean :31.43
## 3rd Qu.:37.00 3rd Qu.:38.0 3rd Qu.:35.25
## Max. :64.00 Max. :62.0 Max. :75.00
library(dplyr)
library(tidyr)
library(ggplot2)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,6)))
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
library(dplyr)
library(tidyr)
library(ggplot2)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,6)))
BPRSL <- BPRSL %>%
group_by(week) %>%
mutate(stdbprs = as.numeric(scale(bprs))) %>%
ungroup()
ggplot(BPRSL, aes(x = week, y = stdbprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(name = "standardized bprs")
Note The tracking phenomenom can be seen from the standardized values of each observation. There are substantial individual differences and variability appears to decrease with time. (MABS4IODS)
Better way to graphically represent the differences between observations, is to plot average profiles of each treatment group.
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,6)))
BPRSL <- BPRSL %>%
group_by(week) %>%
mutate(stdbprs = as.numeric(scale(bprs))) %>%
ungroup()
n <- BPRSL$week %>% unique() %>% length()
BPRSS <- BPRSL %>%
group_by(treatment, week) %>%
summarise(mean = mean(bprs), se = sd(bprs)/sqrt(n)) %>%
ungroup()
ggplot(BPRSS, aes(x = week, y = mean, linetype = treatment, shape = treatment)) +
geom_line() +
scale_linetype_manual(values = c(1,2)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(bprs) +/- se(bprs)")
From mean response profiles for the two treatment groups in the BPRS data, we are able to figure out that there seems to be slight overlap between profiles. It indicates that there are differences between treatment groups.
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,6)))
BPRSL <- BPRSL %>%
group_by(week) %>%
mutate(stdbprs = as.numeric(scale(bprs))) %>%
ungroup()
n <- BPRSL$week %>% unique() %>% length()
BPRSS <- BPRSL %>%
group_by(treatment, week) %>%
summarise(mean = mean(bprs), se = sd(bprs)/sqrt(n)) %>%
ungroup()
ggplot(data = BPRSL, aes(x=weeks, y=bprs)) + geom_boxplot(aes(fill=treatment))
Note Mean profiles can be graphically shown as boxplots where treatment groups are side-by-side. Plot clearly shows that there might exist differences between the treatment groups and outliers (points). Also, there is clear decline in bprs-values over time. Treatments seem to be working for patients.
library(dplyr)
library(tidyr)
library(ggplot2)
library(lme4)
library(cowplot)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,6)))
BPRSL <- BPRSL %>%
group_by(week) %>%
mutate(stdbprs = as.numeric(scale(bprs))) %>%
ungroup()
n <- BPRSL$week %>% unique() %>% length()
BPRSS <- BPRSL %>%
group_by(treatment, week) %>%
summarise(mean = mean(bprs), se = sd(bprs)/sqrt(n)) %>%
ungroup()
BPRSL8S <- BPRSL %>%
filter(week > 0) %>%
group_by(treatment, subject) %>%
summarise( mean=mean(bprs) ) %>%
ungroup()
g1 <- ggplot(BPRSL8S, aes(x = treatment, y = mean)) +
geom_boxplot() +
ggtitle("Original")+
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(bprs), weeks 1-8")
BPRSL8S1 <- BPRSL8S %>%
filter(subject != 11)
g2 <- ggplot(BPRSL8S1, aes(x = treatment, y = mean)) +
geom_boxplot() +
ggtitle("Outlier filtered")+
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(bprs), weeks 1-8")
plot_grid(g1, g2, labels = "AUTO")
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,6)))
BPRSL <- BPRSL %>%
group_by(week) %>%
mutate(stdbprs = as.numeric(scale(bprs))) %>%
ungroup()
n <- BPRSL$week %>% unique() %>% length()
BPRSS <- BPRSL %>%
group_by(treatment, week) %>%
summarise(mean = mean(bprs), se = sd(bprs)/sqrt(n)) %>%
ungroup()
BPRSL8S <- BPRSL %>%
filter(week > 0) %>%
group_by(treatment, subject) %>%
summarise( mean=mean(bprs) ) %>%
ungroup()
t.test(mean ~ treatment, data = BPRSL8S1, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by treatment
## t = 0.50364, df = 36, p-value = 0.6176
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.400946 7.308841
## sample estimates:
## mean in group 1 mean in group 2
## 36.15789 34.70395
# Add the baseline from the original data as a new variable to the summary data
BPRSL8S2 <- BPRSL8S %>%
mutate(baseline = BPRS$week0)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + treatment, data = BPRSL8S2)
# Compute the analysis of variance table for the fitted model with anova()
anova = anova(fit)
anova
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 1868.07 1868.07 30.1437 3.077e-06 ***
## treatment 1 3.45 3.45 0.0557 0.8148
## Residuals 37 2292.97 61.97
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
T-test shatters my earlier conclusion (boxplots) that there would be statistical differences between the treatment groups. p-value = 0.6176
From Anova-table we are able to recognize that bprs-values of each subject are strongly correlated with the baseline (week 0) bprs-values. Although, there is no statistical significance between the treatment groups. p-value is 0.8148.
Comments on data
Gender
There seems to be slight variation in points between the genders:
Difference in mean points is (23.48214-22.32727) = 1.16
Conclusion: males tend to get slightly higher points than women.
Interesting feature between genders is that especially in attitude-variable (which describes general attitude towards statistics) is higher in males than females. Females on the other hand, are higher at stra (strategic approach: organizing studies and time management). Females might find lack of purpose in statistic related studies and studying it might include unrealted memorising because of it (Surf).
Age
When looking at age-variable, one should note that there is extremely high number of occurences 20-30-years category: it might give us bad image from how age truly affects the total points of student.
Correlation between points and age:
Correlation is negative.
Conclusion: older the student is - lower the points are. Although, correlation is not particularly strong.
Attitude, Stra, Deep and Surf
Correlations between students attitude and study habits, and exam points.
The best variable which indicates students points is, by far, attitude. Correlation is 0.44.
Stra and Surf are give second best interpretiations: Students with strategic approach tend to get better points, on the other hand Surf decreases students points (Lack of Purpose, Unrelated Memorising, Syllabus-boundness).